{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "71a329f0",
   "metadata": {},
   "source": [
    "# Standard instruction for using LMI container on SageMaker\n",
    "In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.\n",
    "\n",
    "Please make sure the following permission granted before running the notebook:\n",
    "\n",
    "- S3 bucket push access\n",
    "- SageMaker access\n",
    "\n",
    "## Step 1: Let's bump up SageMaker and import stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67fa3208",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install sagemaker boto3 awscli --upgrade  --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec9ac353",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "from sagemaker import Model, image_uris, serializers, deserializers, multidatamodel\n",
    "\n",
    "role = sagemaker.get_execution_role()  # execution role for the endpoint\n",
    "sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs\n",
    "region = sess._region_name  # region name of the current SageMaker Studio environment\n",
    "account_id = sess.account_id()  # account_id of the current SageMaker Studio environment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81deac79",
   "metadata": {},
   "source": [
    "## Step 2: Start preparing model artifacts\n",
    "In LMI contianer, we expect some artifacts to help setting up the model\n",
    "- serving.properties (required): Defines the model server settings\n",
    "- model.py (optional): A python file to define the core inference logic\n",
    "- requirements.txt (optional): Any additional pip wheel need to install"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19d6798b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile model.py\n",
    "from djl_python import Input, Output\n",
    "import os\n",
    "import torch\n",
    "from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "predictor = None\n",
    "\n",
    "def get_model(properties):\n",
    "    model_name = properties['model_id']\n",
    "    local_rank = int(os.getenv('LOCAL_RANK', '0'))\n",
    "    dtype = torch.float16\n",
    "    model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=dtype)\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "    generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)\n",
    "    return generator\n",
    "\n",
    "\n",
    "def handle(inputs: Input) -> None:\n",
    "    global predictor\n",
    "    if not predictor:\n",
    "        predictor = get_model(inputs.get_properties())\n",
    "\n",
    "    if inputs.is_empty():\n",
    "        # Model server makes an empty call to warmup the model on startup\n",
    "        return None\n",
    "\n",
    "    data = inputs.get_as_json()['prompt']\n",
    "    result = predictor(data, do_sample=True)\n",
    "    return Output().add(result)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "671ea485",
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "import os\n",
    "models_to_run=[\"facebook/opt-350m\", \"bigscience/bloomz-560m\", \"EleutherAI/gpt-neo-125M\", \"cerebras/Cerebras-GPT-590M\"]\n",
    "model_folders = [model.split(\"/\")[1].lower() for model in models_to_run]\n",
    "\n",
    "for folder, model in zip(model_folders, models_to_run):\n",
    "    if os.path.exists(folder):\n",
    "        shutil.rmtree(folder)\n",
    "    os.makedirs(folder)\n",
    "    with open(os.path.join(folder, \"serving.properties\"), \"w\") as f:\n",
    "        f.write(f\"engine=Python\\noption.model_id={model}\\n\")\n",
    "    shutil.copyfile(\"model.py\", f\"{folder}/model.py\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2e9949c",
   "metadata": {},
   "source": [
    "### DJLServing memory management for MME\n",
    "\n",
    "In DJLServing, you could control how many memory allocated for each CPU/GPU on SageMaker. It works like below:\n",
    "\n",
    "- `required_memory_mb` CPU/GPU required memory in MB\n",
    "- `reserved_memory_mb` CPU/GPU reserved memory for computation\n",
    "- `gpu.required_memory_mb` GPU required memory in MB\n",
    "- `gpu.reserved_memory_mb` GPU reserved memory for computation\n",
    "\n",
    "If you need 20GB CPU memory and 2GB GPU memory, you could set\n",
    "\n",
    "```\n",
    "required_memory_mb=20480\n",
    "gpu.required_memory_mb=2048\n",
    "```\n",
    "\n",
    "in the following code, we will create a bomb model that plans to take over all GPU memory and let's see how that would impact the result. For more information on settings, please find them [here](https://docs.djl.ai/docs/serving/serving/docs/modes.html#servingproperties)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5992c8d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile serving.properties\n",
    "engine=Python\n",
    "option.model_id=facebook/opt-350m\n",
    "gpu.reserved_memory_mb=30000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0142973",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "cp -r opt-350m/ bomb/\n",
    "mv serving.properties bomb/\n",
    "tar czvf opt-350m.tar.gz opt-350m/\n",
    "tar czvf bloomz-560m.tar.gz bloomz-560m/\n",
    "tar czvf gpt-neo-125m.tar.gz gpt-neo-125m/\n",
    "tar czvf cerebras-gpt-590m.tar.gz cerebras-gpt-590m/\n",
    "tar czvf bomb.tar.gz bomb/\n",
    "rm -rf opt-350m/ bloomz-560m/ gpt-neo-125m/ cerebras-gpt-590m/ bomb/ model.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e58cf33",
   "metadata": {},
   "source": [
    "## Step 3: Start building SageMaker endpoint\n",
    "In this step, we will build SageMaker endpoint from scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d955679",
   "metadata": {},
   "source": [
    "### Getting the container image URI\n",
    "\n",
    "Available framework are:\n",
    "- djl-deepspeed (0.20.0, 0.21.0)\n",
    "- djl-fastertransformer (0.21.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a174b36",
   "metadata": {},
   "outputs": [],
   "source": [
    "image_uri = image_uris.retrieve(\n",
    "        framework=\"djl-deepspeed\",\n",
    "        region=sess.boto_session.region_name,\n",
    "        version=\"0.21.0\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11601839",
   "metadata": {},
   "source": [
    "### Upload artifact on S3 and create SageMaker model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38b1e5ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_code_prefix = \"large-model-lmi/code\"\n",
    "bucket = sess.default_bucket()  # bucket to house artifacts\n",
    "for model_name in model_folders:\n",
    "    code_artifact = sess.upload_data(f\"{model_name}.tar.gz\", bucket, s3_code_prefix)\n",
    "    print(f\"S3 Code or Model tar ball uploaded to --- > {code_artifact}\")\n",
    "env = {\"HUGGINGFACE_HUB_CACHE\": \"/tmp\", \"TRANSFORMERS_CACHE\": \"/tmp\"}\n",
    "model_s3_folder = os.path.dirname(code_artifact) + \"/\"\n",
    "\n",
    "model = multidatamodel.MultiDataModel(\"LMITestModel\", model_s3_folder, image_uri=image_uri, env=env, role=role)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "004f39f6",
   "metadata": {},
   "source": [
    "### 4.2 Create SageMaker endpoint\n",
    "\n",
    "You need to specify the instance to use and endpoint names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e0e61cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_type = \"ml.g5.2xlarge\"\n",
    "endpoint_name = sagemaker.utils.name_from_base(\"lmi-model\")\n",
    "\n",
    "model.deploy(initial_instance_count=1,\n",
    "             instance_type=instance_type,\n",
    "             endpoint_name=endpoint_name,\n",
    "             # container_startup_health_check_timeout=3600\n",
    "            )\n",
    "\n",
    "# our requests and responses will be in json format so we specify the serializer and the deserializer\n",
    "predictor = sagemaker.Predictor(\n",
    "    endpoint_name=endpoint_name,\n",
    "    sagemaker_session=sess,\n",
    "    serializer=serializers.JSONSerializer(),\n",
    "    deserializer=deserializers.JSONDeserializer(),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb63ee65",
   "metadata": {},
   "source": [
    "## Step 5: Test and benchmark the inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bcef095",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(predictor.predict( {\"prompt\": \"Large model inference is\"}, target_model=\"opt-350m.tar.gz\"))\n",
    "print(predictor.predict({\"prompt\": \"Large model inference is\"}, target_model=\"bloomz-560m.tar.gz\"))\n",
    "print(predictor.predict({\"prompt\": \"Large model inference is\"}, target_model=\"gpt-neo-125m.tar.gz\"))\n",
    "print(predictor.predict({\"prompt\": \"Large model inference is\"}, target_model=\"cerebras-gpt-590m.tar.gz\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "573bac44",
   "metadata": {},
   "source": [
    "### Testing a bomb model\n",
    "\n",
    "Now let's see if I have a model need 30GB GPU memory and what will happen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07d7f765",
   "metadata": {},
   "outputs": [],
   "source": [
    "code_artifact = sess.upload_data(f\"bomb.tar.gz\", bucket, s3_code_prefix)\n",
    "print(f\"S3 Code or Model tar ball uploaded to --- > {code_artifact}\")\n",
    "try:\n",
    "    predictor.predict({\"prompt\": \"Large model inference is\"}, target_model=\"bomb.tar.gz\")\n",
    "except Exception as e:\n",
    "    print(\"Loading failed...You can still load more models that are smaller than the gpu sizes\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1cd9042",
   "metadata": {},
   "source": [
    "The model loading failed since the total GPU memory is 24GB and cannot holds a 30GB model. You will find the model server is still alive. Behind the scence, SageMaker will unload all models to spare spaces. So currently there is no model loaded. You could rerun the 4 prediction above and model server will reload the model back again.\n",
    "\n",
    "## Clean up the environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d674b41",
   "metadata": {},
   "outputs": [],
   "source": [
    "sess.delete_endpoint(endpoint_name)\n",
    "sess.delete_endpoint_config(endpoint_name)\n",
    "model.delete_model()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
